Menu Close

Analysing NLP publication patterns

Recently, I got curious about finding out how much different institutions publish in my area. Does Google publish more than Microsoft? Which university has the strongest publication record in NLP? And are there any interesting trends that can be seen in the recent years? Quantity does not necessarily equal quality, but the number of publications is still a reasonable indicator of general activity in the field, how big the research group is, and how outward-facing are the research projects.

My approach was to crawl papers from the 6 biggest conferences that are relevant to my research: ACL, EACL, NAACL, EMNLP, NIPS, ICML. The first 4 focus on NLP applications regardless of methods, and the latter 2 on machine learning algorithms regardless of tasks. The time window was restricted to 2012-2016, as I’m more interested in current publications.

Luckily, all these conferences have nice webpages listing all the papers published there. ACL Anthology contains records for ACL, EACL, NAACL and EMNLP, NIPS has a separate webpage for papers, and ICML proceedings are on the JMLR website (except for ICML12 which are on the conference website). I wrote python scripts that crawled all the papers from these conferences, extracting author names and organisations. While authors can be crawled directly from the websites, in order to find the organisation names I had to parse the pdfs into text and extract anything that looked like a university or company name in the first 30 lines of on the paper. I wrote a bunch of manual patterns to map names to canonical versions (“UCL” to “University College London” and “Google Inc” to “Google”), although it is likely that I still missed some edge cases.

Below is the graph of top 25 organisations and the conferences where they publish.

CMU comes out as the most prolific publisher with 305 papers. A close second is Microsoft with 302 publications, also leading in the industry category. I was somewhat surprised to find that Microsoft publishes so much, almost twice as many papers compared to Google, especially as Google seems to get much more publicity with their research. Stanford is also among the top 3 organisations that publish substantially more than others. Edinburgh and Cambridge represent the UK camp with 121 and 117 papers respectively.

When we look at the distribution of conferences, Princeton and UCL stand out as having very little NLP-specific research, with nearly all of their papers in ICML and NIPS. Stanford, Berkeley and MIT also seem to focus more on machine learning algorithms. In contrast, Edinburgh, Johns Hopkins and University of Maryland have most of their publications on NLP-related conferences. CMU, Microsoft and Columbia are the most balanced among the top publishers, with roughly 50:50 division between NLP and ML.

We can also plot the number of publications per year, focusing on the top 15 institutions.

Carnegie Mellon has a very good track record, but has only just recently overtaken Microsoft as the top publisher. Google, MIT, Berkeley, Cambridge and Princeton have also stepped up their publishing game, showing upward trends in the recent years. The sudden drop for 2016 is due to incomplete data – at the time of writing, ACL, EMNLP and NIPS papers for this year are not available yet.

Now let’s look at the same graphs but for individual authors.

Chris Dyer comes out on top with 50 papers. This result is even more impressive given that he started with just 2 papers in 2012, then rocketing to the top by quite a margin in 2015. Almost all of his papers are in NLP conferences, with only 1 paper each for NIPS and ICML. Noah Smith, Chris Manning and Dan Klein rank 2nd-4th, with more stable publishing records, but also focusing mainly on NLP conferences. In contrast, Zoubin Ghahramani, Yoshua Bengio and Lawrence Carin are focused mostly on machine learning algorithms.

There seems to be a clear separation between the two research communities, with researchers specialising to publishing either in NLP or ML. This seems somewhat unexpected, especially considering the widespread trend of publishing novel neural network architectures for NLP tasks. Both fields would probably benefit from slightly tighter integration in the future.

I hope this little analysis was interesting to fellow researchers. I’m happy to post an update some time in the future, to see how things have changed. In the meantime, let me know if you find any bugs in the statistics.

Update: As requested, I’ve also added the statistics for first authors with highest publication counts. Jiwei Li from Stanford towers above others with 14 publications. William Yang Wang (CMU), Young-Bum Kim (Microsoft), Manaal Faruqui (CMU), Elad Hazan (Princeton), and Eunho Yang (IBM) have all managed an impressive 9 first-author publications.

Update 2: Added a fix for Jordan Boyd-Graber who publishes under Jordan L. Boyd-Graber in NIPS.

Update 3: Added a fix for Hal Daumé III, mapping together different spellings.

Update 4: By showing top N authors on the graphs, some authors with equal numbers of publications were being excluded. I’ve adjusted the value N for each graph so this doesn’t happen.

Update 5: Added a fix for Pradeep K. Ravikumar who also publishes under Pradeep Ravikumar.

Update 6: Added fixes to capture name variations for INRIA.

41 Comments

    • Marek

      The full data on organisations is quite noisy at the lower ranks at the moment, as it is extracted from pdfs and then post-processed with manual rules. It still contains a long tail of alternative spellings and entries that are not institutions at all (eg College Park).
      Imperial College London comes up with 7 entries in there. Although worth noting that I’m only looking at 6 specific conferences, and Imperial seems to be publishing in somewhat different areas.

    • Marek

      Thanks! Indeed, I’m not catching alternative names for authors at the moment. I will update it soon and add a fix for your name.

  1. Jason Eisner

    How about including TACL? It’s a journal, but deliberately set up to be another mechanism for publishing normal ACL-style papers, so leaving it out of the analysis is strange. The format is essentially the same as ACL/NAACL/EMNLP/EACL, and you get to present the work at one of those conferences. Downloading and scraping the papers should be no different than for ACL. Whether you submit via TACL or directly via the conferences is as much a matter of when the deadlines fall as anything else. (Although TACL papers arguably should count a bit more: they generally get more thorough reviews, are often required to make revisions for final acceptance, and tend to be longer.)

    There’s also a question of whether long-form journal papers (JMLR, CL, etc.) should be included in measures of productivity. Perhaps those are often just synthesizing and expanding previously published conference papers? – but I’m not sure.

    Of course, I hope that no one optimizes for your ranking.

    • Marek

      The 6 conferences I chose simply based on which sources I personally follow the most. I completely agree that there are many other conferences and journals that could be included: TACL, COLING, CoNLL, *Sem, IJCAI, IJNLP, LREC, JMLR, CL, CIKM, AAAI, WWW, etc.
      I intend to post an update at the end of the year, and will include a longer list of conferences. Feel free to suggest additional sources which I haven’t listed yet.

      • Wei Xu

        I second Jason. TACL is essentially equal to ACL/NAACL/EMNLP/EACL; it is quite different from COLING, CoNLL, *Sem, IJCAI, IJNLP, LREC, JMLR, CL, CIKM, AAAI, WWW, etc, and much more right in the center of NLP research. I would recommend anyone interested in NLP to follow TACL papers (if not more closely) in addition to ACL/NAACL/EMNLP/EACL.

    • Adelaide

      I wanted to visit and allow you to know how , very much I trrusaeed discovering your blog today. We would consider it a good honor to work at my office and be able to utilize tips discussed on your web site and also engage in visstori’ comments like this. Should a position associated with guest writer become on offer at your end, please let me know.

    • http://www./

      CRICKETNight falls, and all comes to rest as best as can be allowed. The shroud of Autumn lurks and works its way into this scene. Serene and sedate. The late summer air is soothed by symphonic sounds. A soft chirp begins the overture, and it’s for sure that it will play until morning. The strains are lilting, never wilting or reaching crescendo, a slow and steady melody. Music of the night. hidden musicianplaying through the gentle nightdelight in your song

    • http://www./

      asri persoalkan why non muslim parti roket n kapal tenggelam tu xnak terima hudud?thats why die suruh abby bace dengan teliti tweet die. WTH,TF abby tu nak spam tweet ust asri mcm tu.BS!Well-loved.

  2. Ryan

    Thanks for the nice post, but some of the numbers seem off, and the errors may be related to parsing Chinese names. For example, Yuchen Zhang does not have an EMNLP, and there are at least two Yuxin Chen working in this area but neither of them has 7 ICML+NIPS alone. Perhaps you double-counted other people named Y. Zhang or Y. Chen?

  3. Jochen L Leidner

    Nice inforgraphic, thanks! Immediate feature requests: How about patents? Including IR? Speech? Top single-authors? Or which university fosters most team-coauthoring? CItation impact per institution?

    • Stella

      om normal atunci cand vede ca greseste in anumite domenii nu mai continua sa le ia din nou de la capat! Tu te agati cu dintii de orice cuvant care poate avea mai multe sensuri doar pentru a-ti demonstra tie ca dumnezeul tau exista! Cum nu poti sa-ti dai seama ca ceva nu este in regula cu religia asta?! Se pare ca iti este frica sa afli adetraul!Opresve-te putin si mediteaza! oricum cu tine nu se va putea ajunge la o concluzie caci tu selectezi doar informatia care iti convine! si problema este ca pe toate informatiile le-ai luat pe nemestecate de pe siturile creationiste.

    • www kreditkarte de

      Intervju med en agent om vad som avgör priset pÃ¥ en spelareDiskussioner kring truppen samt framförallt – vilka är de potentiella nyförvärven frÃ¥n Argentina? Det borde gÃ¥ att luska fram, det handlar ju trots allt bara om tvÃ¥ argentinska klubbar?

    • jegs coupons

      Peace!I had the graceful opportunity to hear the brothers in Lord on the “3a. Igreja do Evangelho Quadrangular” in Curitiba. We listen a encourage word and watched a movie about Abraham and Isaac.

  4. EXG

    Nice infographics! Quick comment: I believe INRIA is missing. Just by counting NIPS 2012-2015, I get more than 60 papers.

    • Marek

      Good point, thanks for letting me know. I’ve added a fix for mapping together different ways of naming INRIA. They are now featured in the top 25.

  5. John

    I wonder why you have included ICML and NIPS into your analysis. There is some spillover from ML into NLP and vice versa but generally within the NLP community, only the big four (ACL, NAACL, EMNLP, and EACL) matter. The other two are really machine learning conferences and are not that much of interest to researchers in Computational Linguistics/NLP, so the data from NIPS and ICML are more like noise and don’t give you much information on current trends in the field.

    • Marek

      I chose the conferences that influence my work the most. Totally subjective, I agree. On the spectrum of linguistics-NLP-ML, I am more on the ML side.

  6. buy electronics, house electronics, computers, music, software, games, appliances, parts, laptops, stores, accessories, tvs, purchase hardware, house gadgets, PCs, music, programming, diversions, machines, parts, tablets, stores, extras, purchase iphone,

    You’re so interesting! I do not suppose I have read through a single thing like that before. So nice to discover somebody with a few unique thoughts on this topic. Seriously.. thank you for starting this up. This web site is something that is required on the web, someone with a bit of originality!

  7. online psychic

    What i don’t understood is if truth be told how you are not really
    a lot more smartly-liked than you might be right now.
    You are very intelligent. You understand therefore considerably
    relating to this subject, made me individually believe it from numerous various angles.

    Its like men and women don’t seem to be interested except it is one thing
    to accomplish with Lady gaga! Your individual stuffs nice.
    All the time care for it up!

  8. new website

    Humans need to attach with other humans. This is why
    it important that obtain people to opt-in and add their names towards the distribution email list.
    It’s also a computer to help the members make contact with each other online.
    However, there is a strategy to utilizing this system to
    its full plausible. Hardly ever meet through social media outlets before they meet at events.

    Four years later, Facebook is still a crucial factor regarding growth of
    Working Women of Tampa Bay and dealing Women of Central Minnesota.

    But how do you implement it on world wide web?
    It may look like likely to makes us better consumers able to take advantage in excess
    of deals but be wary of the new layer of noise and data being thrown at you while you have a the associated
    with shopping searching to make wise moves. One would prefer
    to buy something from someone he knows because buying from a stranger could
    make him anxious. So, web credibility’s pretty important in this case.

    Groupons, FB Marketplace, location based coupons on your own desktop or to your smartphone, are
    just starting to go mainstream and will certainly grow substantially in the
    long run. This way, they along with an assurance of their trust in addition to their money.
    I myself have a mentor that provides the program and he’s seen significant gains inside the Twitter following in just five a long
    time. Advertising can help your company more visible, but advertising is costly – especially in a competitive market like
    web web hosting service.

    While slim down you formerly brought the Facebook hashtag to the site,
    it’s got not enabled the linking functionality.
    In this article I like to talk about what I really believe is basically “white lie” in a mans enhancement
    universe. Now, before I go further, I’ve got to stress
    that a lot of of these false claims falls under the category of ‘it’s obvious once
    talked about how much it’.

  9. bit.do

    Pitbull, have y’guys went Baccarat in the LasVegas,and had attacked $423000-around,btw?

    Then, att,’They were, ALL-Worse. it’s sth like.
    It’s the 5 floor’s story,btw
    Be careful Bed ..th, and M in LasVegas luvaz,’There are, Those ハゲ頭and those-staffs,
    anw,mf,assho

Leave a Reply

Your email address will not be published. Required fields are marked *